1. Libraries

This section loads the necessary R packages for data manipulation, visualization, and other tasks. If a package is not installed, it installs it first and then loads it.

2. Dataset

This section for loading the dataset from the provided URL and displaying a summary of the data.

Here are some of the key attributes:

StichtagDatJahr: Year of the data record HalterId: Identifier for the pet owner AlterV10Cd, AlterV10Lang, AlterV10Sort: Codes, descriptions, and sorting for the age group of the pet owner SexCd, SexLang, SexSort: Codes, descriptions, and sorting for the gender of the pet owner KreisCd, KreisLang, KreisSort: Codes, descriptions, and sorting for the district QuarCd, QuarLang, QuarSort: Codes, descriptions, and sorting for the quarter Rasse1Text, Rasse2Text: Primary and secondary breed texts of the pet RasseMischlingCd, RasseMischlingLang, RasseMischlingSort: Codes, descriptions, and sorting for whether the pet is a mixed breed RassentypCd, RassentypLang, RassentypSort: Codes, descriptions, and sorting for the type of breed GebDatHundJahr: Year of birth of the pet AlterVHundCd, AlterVHundLang, AlterVHundSort: Codes, descriptions, and sorting for the age group of the pet SexHundCd, SexHundLang, SexHundSort: Codes, descriptions, and sorting for the gender of the pet HundefarbeText: Color of the pet AnzHunde: Number of dogs

3. Translation

In this section, we duplicate and rename the dataframe df as df_EN for the English version. Then, translations for column names in English are defined. Following this, a function is employed to replace multiple patterns at once for content translation. Patterns and replacements for content translation, including translations for age groups, sexes, breed types, and dog colors, are defined. After applying the translation function across all columns, dog colors are also translated.

4. Unique Owner IDs

In the following R code snippet, we implement a method to distinguish unique OwnerId values within our dataset. By marking the initial occurrence of each OwnerId as unique, we facilitate further analyses that may require the identification of distinct entries.

df_EN$unique_OwnerId <- !duplicated(df_EN$OwnerId)
View(df_EN)

5. Refined Dataset

The R code below demonstrates the process of extracting a subset of relevant columns from our comprehensive dataset df_EN, thereby creating a streamlined DataFrame, new_df. This subset includes essential fields such as KeyDateYear, OwnerId, and details regarding the dogs such as PrimaryBreed and DogBirthYear. Additionally, the code converts the NumberOfDogs column from its original format to a numeric type, ensuring that subsequent data analysis can utilize numerical operations.

# Create a new DataFrame with selected columns and convert 'NumberOfDogs' to numeric
new_df <- df_EN %>%
  select(KeyDateYear, OwnerId, OwnerAgeGroup, OwnerSex, DistrictSort, QuarCd, PrimaryBreed, SecondaryBreed, MixedBreed, BreedType, DogBirthYear,    DogSex, NumberOfDogs, unique_OwnerId) %>%
  mutate(NumberOfDogs = as.numeric(as.character(NumberOfDogs)))

View(new_df)

6. Analyzing Diversity in Dataset Features: Years, Owner IDs, and Age Groups

This series of R code snippets delves into the examination of key features within the new_df DataFrame, focusing on the identification and analysis of unique entries for KeyDateYear, OwnerId, and OwnerAgeGroup. Each code section is designed to extract unique values, count these entries, and where applicable, visualize the distribution. Such analysis is integral for understanding the dataset’s diversity across different dimensions, helping to highlight temporal coverage, ownership uniqueness, and demographic variations among owners.

# Extract and count unique years
unique_years <- unique(new_df$KeyDateYear)
number_of_unique_years <- length(unique_years)
print(number_of_unique_years)
## [1] 9
print(unique_years)
## [1] "2015" "2016" "2017" "2018" "2019" "2020" "2021" "2022" "2023"
# Extract and count unique Owner IDs
unique_Owner <- unique(new_df$OwnerId)
number_of_unique_Owner <- length(unique_Owner)
print(number_of_unique_Owner)
## [1] 15504
# Aggregate unique Owner IDs by Age Group and Year correctly with n_distinct
unique_owner_counts <- new_df %>%
  group_by(KeyDateYear, OwnerAgeGroup) %>%
  summarise(UniqueOwnerCount = n_distinct(OwnerId), .groups = 'drop')  # Count unique IDs per group per year

# Print the aggregated results
print(unique_owner_counts)
## # A tibble: 90 × 3
##    KeyDateYear OwnerAgeGroup      UniqueOwnerCount
##    <chr>       <chr>                         <int>
##  1 2015        10 to 19 years old               23
##  2 2015        20 to 29 years old              604
##  3 2015        30 to 39 years old             1173
##  4 2015        40 to 49 years old             1287
##  5 2015        50 to 59 years old             1375
##  6 2015        60 to 69 years old              954
##  7 2015        70 to 79 years old              647
##  8 2015        80 to 89 years old              191
##  9 2015        90 to 99 years old               19
## 10 2015        Unknown                          43
## # ℹ 80 more rows
# Adjust factor levels in the aggregated data before plotting
unique_owner_counts <- unique_owner_counts %>%
  arrange(desc(UniqueOwnerCount)) %>%
  mutate(OwnerAgeGroup = fct_inorder(OwnerAgeGroup))

# Get a list of unique years
years <- unique(unique_owner_counts$KeyDateYear)

# Loop through each year and create a plot, ordering by max number in the group
for (year in years) {
  # Filter data for the specific year
  data_for_year <- filter(unique_owner_counts, KeyDateYear == year)
  
  # Create the plot
  p <- ggplot(data_for_year, aes(x = OwnerAgeGroup, y = UniqueOwnerCount, fill = OwnerAgeGroup)) +
    geom_bar(stat = "identity", position = "dodge") +
    geom_hline(yintercept = c(100, 500, 1000, 1500, 2000), linetype = "dashed", color = "red") + 
    theme_minimal() +
    labs(title = paste("Unique Owner IDs by Age Group in", year),
         x = "Owner Age Group",
         y = "Count of Unique Owner IDs") +
    scale_fill_brewer(palette = "Paired") +
    scale_y_continuous(limits = c(0, 2500), breaks = seq(0, 2500, by = 500)) + # Setting uniform y-axis scale
    scale_x_discrete(labels = function(x) {
      # Remove age numbers and the word "Unknown"
      x <- gsub("[0-9]+ to [0-9]+ years old", "", x)
      gsub("Unknown", "", x)
    })

  # Print the plot
  print(p)
}

# Aggregate unique Owner IDs by Age Group, Year, and Gender
unique_owner_counts_gender <- new_df %>%
  group_by(KeyDateYear, OwnerAgeGroup, OwnerSex) %>%
  summarise(UniqueOwnerCountGender = n_distinct(OwnerId), .groups = 'drop')  # Count unique IDs per group per year by gender

# Print the aggregated results
print(unique_owner_counts)
## # A tibble: 90 × 3
##    KeyDateYear OwnerAgeGroup      UniqueOwnerCount
##    <chr>       <fct>                         <int>
##  1 2023        30 to 39 years old             2376
##  2 2022        30 to 39 years old             2180
##  3 2023        40 to 49 years old             1930
##  4 2021        30 to 39 years old             1891
##  5 2022        40 to 49 years old             1813
##  6 2023        50 to 59 years old             1689
##  7 2022        50 to 59 years old             1688
##  8 2021        40 to 49 years old             1631
##  9 2020        30 to 39 years old             1607
## 10 2021        50 to 59 years old             1598
## # ℹ 80 more rows
print(unique_owner_counts_gender)
## # A tibble: 180 × 4
##    KeyDateYear OwnerAgeGroup      OwnerSex UniqueOwnerCountGender
##    <chr>       <chr>              <chr>                     <int>
##  1 2015        10 to 19 years old female                       17
##  2 2015        10 to 19 years old male                          6
##  3 2015        20 to 29 years old female                      467
##  4 2015        20 to 29 years old male                        137
##  5 2015        30 to 39 years old female                      797
##  6 2015        30 to 39 years old male                        376
##  7 2015        40 to 49 years old female                      838
##  8 2015        40 to 49 years old male                        449
##  9 2015        50 to 59 years old female                      941
## 10 2015        50 to 59 years old male                        434
## # ℹ 170 more rows
# Get a list of unique years
years <- unique(unique_owner_counts$KeyDateYear)

# Loop through each year and create a plot with consistent y-axis scale
for (year in years) {
  # Filter data for the specific year for both total counts and gender-specific counts
  data_for_year_total <- filter(unique_owner_counts, KeyDateYear == year)
  data_for_year_gender <- filter(unique_owner_counts_gender, KeyDateYear == year)

  # Create the plot with bars for totals and points for gender
  p <- ggplot(data_for_year_total, aes(x = OwnerAgeGroup, y = UniqueOwnerCount)) +
    geom_bar(stat = "identity", position = "dodge", aes(fill = OwnerAgeGroup)) +
    geom_point(data = data_for_year_gender, 
               aes(x = OwnerAgeGroup, y = UniqueOwnerCountGender, group = OwnerSex, color = OwnerSex),
               position = position_dodge(width = 0.9), size = 3) +
    geom_hline(yintercept = c(100, 500, 1000, 1500), linetype = "dashed", color = "red") +
    theme_minimal() +
    labs(title = paste("Unique Owner IDs by Age Group and Gender in", year),
         x = "Owner Age Group",
         y = "Count of Unique Owner IDs") +
    scale_fill_brewer(palette = "Paired") +  # Color for bars
    scale_color_manual(values = c("female" = "pink", "male" = "blue")) +  # Color for points
    scale_y_continuous(limits = c(0, 2500), breaks = seq(0, 2500, by = 500)) + 
    scale_x_discrete(labels = function(x) {
      # Remove age numbers and the word "Unknown"
      x <- gsub("[0-9]+ to [0-9]+ years old", "", x)
      gsub("Unknown", "", x)
    })
  
  # Print the plot
  print(p)
}

# Adjust factor levels in the aggregated data before plotting
unique_owner_counts <- unique_owner_counts %>%
  arrange(desc(UniqueOwnerCount)) %>%
  mutate(OwnerAgeGroup = fct_inorder(OwnerAgeGroup),
         KeyDateYear = as.numeric(as.character(KeyDateYear)))  # Convert KeyDateYear to numeric

# Create the line plot for all years with a line per age group
p <- ggplot(unique_owner_counts, aes(x = KeyDateYear, y = UniqueOwnerCount, group = OwnerAgeGroup, color = OwnerAgeGroup)) +
  geom_line(size = 1) +  # Add line
  geom_point(size = 3) +  # Add points
  geom_hline(yintercept = c(100, 500, 1000, 1500, 2000), linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(title = "Unique Owner IDs by Age Group Over Years",
       x = "Year",
       y = "Count of Unique Owner IDs") +
  scale_color_brewer(palette = "Paired") +  # Use color to differentiate lines
  scale_y_continuous(limits = c(0, 2500), breaks = seq(0, 2500, by = 500)) +
  scale_x_continuous(breaks = seq(min(unique_owner_counts$KeyDateYear), max(unique_owner_counts$KeyDateYear), by = 1))  # Define breaks in the x-axis scale to show each year
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Print the plot
print(p)

7. Yearly Dog Counts

After confirming successful conversion, we aggregated the data to compute the total number of dogs per year. The resulting counts were then visualized using histograms to illustrate the distribution over the years.

Furthermore, to understand the trend in dog population over time, we calculated the percentage change between consecutive years. This allowed us to identify any notable fluctuations or patterns in the data.

new_df <- new_df %>% 
  mutate(NumberOfDogs = as.numeric(NumberOfDogs))

# Check for any conversion problems
sum(is.na(new_df$NumberOfDogs))
## [1] 0
new_df$KeyDateYear <- as.numeric(as.character(new_df$KeyDateYear))

# Aggregate data to get total number of dogs per year
yearly_dog_counts <- new_df %>%
  group_by(KeyDateYear) %>%
  summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop')

print(yearly_dog_counts)
## # A tibble: 9 × 2
##   KeyDateYear TotalDogs
##         <dbl>     <dbl>
## 1        2015      6980
## 2        2016      6930
## 3        2017      7155
## 4        2018      7400
## 5        2019      7647
## 6        2020      7841
## 7        2021      8574
## 8        2022      9173
## 9        2023      9512
# Create the histogram
ggplot(yearly_dog_counts, aes(x = KeyDateYear, y = TotalDogs)) +
  geom_col(fill = "skyblue", color = "black") +  # Using geom_col for clarity
  geom_hline(yintercept = c(2500, 5000, 7500), linetype = "dashed", color = "red") +  
  theme_minimal() +
  labs(title = "Total Number of Dogs per Year",
       x = "Year",
       y = "Total Number of Dogs") +
  scale_x_continuous(breaks = yearly_dog_counts$KeyDateYear, 
                     labels = yearly_dog_counts$KeyDateYear) +  
  scale_y_continuous(labels = scales::comma, 
                     breaks = seq(0, 10000, by = 1000),
                     limits = c(0, 10000)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

### Calculate the PERCENTAGE CHANGE for annotations ###
yearly_dog_counts <- yearly_dog_counts %>%
  arrange(KeyDateYear) %>%
  mutate(Change = c(NA, diff(TotalDogs)),
         PercentChange = Change / lag(TotalDogs) * 100)

# Create the base plot
ggplot(yearly_dog_counts, aes(x = KeyDateYear, y = TotalDogs)) +
  geom_col(fill = "skyblue", color = "black") +
  geom_hline(yintercept = c(2500, 5000, 7500), linetype = "dashed", color = "red") +  # Dashed lines at specified counts
  geom_smooth(method = "lm", color = "red", linetype = "dashed", se = FALSE) +  # Add a linear trend line
  geom_text(data = yearly_dog_counts, aes(label = sprintf("%.1f%%", PercentChange)), 
            vjust = -1.5, hjust = 0.5, color = "darkgreen", size = 3.5) +
  theme_minimal() +
  labs(title = "Total Number of Dogs per Year",
       x = "Year",
       y = "Total Number of Dogs") +
  scale_x_continuous(breaks = yearly_dog_counts$KeyDateYear) +
  scale_y_continuous(labels = scales::comma, 
                     breaks = seq(0, 10000, by = 1000),  # Adjust y-axis to have units of 1000
                     limits = c(0, 10000)) +  # Ensure y-axis goes up to 10000
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")
## `geom_smooth()` using formula = 'y ~ x'

8. Total number of dogs per year by owner’s Gender

We explore the unique owners per year based on gender. The analysis starts by counting each owner once per year, regardless of the number of dogs they own.

# Step 1: Count each owner once per year regardless of the number of dogs
yearly_dog_counts_by_gender_unique <- new_df %>%
  select(KeyDateYear, OwnerId, OwnerSex) %>%
  distinct(KeyDateYear, OwnerId, OwnerSex) %>%
  group_by(KeyDateYear, OwnerSex) %>%
  summarize(UniqueOwners = n(), .groups = 'drop')

print(yearly_dog_counts_by_gender_unique)
## # A tibble: 18 × 3
##    KeyDateYear OwnerSex UniqueOwners
##          <dbl> <chr>           <int>
##  1        2015 female           4276
##  2        2015 male             2040
##  3        2016 female           4263
##  4        2016 male             2012
##  5        2017 female           4413
##  6        2017 male             2035
##  7        2018 female           4581
##  8        2018 male             2096
##  9        2019 female           4746
## 10        2019 male             2183
## 11        2020 female           4882
## 12        2020 male             2261
## 13        2021 female           5379
## 14        2021 male             2483
## 15        2022 female           5825
## 16        2022 male             2639
## 17        2023 female           6045
## 18        2023 male             2776
# Step 2: Create histograms for male and female owners without repetition
ggplot(yearly_dog_counts_by_gender_unique, aes(x = KeyDateYear, y = UniqueOwners, fill = OwnerSex)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  theme_minimal() +
  labs(title = "Unique Owners per Year by Gender",
       x = "Year",
       y = "Number of Unique Owners") +
  scale_fill_manual(values = c("male" = "blue", "female" = "pink")) +
  geom_hline(yintercept = c(2500, 5000, 7500), linetype = "dashed", color = "red") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_continuous(breaks = yearly_dog_counts_by_gender_unique$KeyDateYear) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

9. Heatmap of total number of dogs per year by owner’s Gender

We proceeded to create a heatmap illustrating the distribution of dogs based on the gender and age group of their owners across different years. First, we organized the data by grouping it according to the year, owner’s age group, and gender. Then, we iterated through each year, generating separate heatmaps to visualize the data for that specific year. Each heatmap represents the total number of dogs in various age groups categorized by the gender of their owners. The color gradient within the heatmap indicates the intensity of dog ownership, with warmer colors representing higher dog counts.

# Adjust the aggregation to count each owner once per age group and gender, per year
owner_counts <- new_df %>%
  select(KeyDateYear, OwnerAgeGroup, OwnerSex, OwnerId) %>%
  distinct(KeyDateYear, OwnerAgeGroup, OwnerSex, OwnerId) %>%
  group_by(KeyDateYear, OwnerAgeGroup, OwnerSex) %>%
  summarize(UniqueOwners = n(), .groups = 'drop')

# Loop through each year and create a heatmap
years <- unique(owner_counts$KeyDateYear)

for (year in years) {
  yearly_data <- filter(owner_counts, KeyDateYear == year)
  
  p <- ggplot(yearly_data, aes(x = OwnerSex, y = OwnerAgeGroup, fill = UniqueOwners)) +
    geom_tile() +
    scale_fill_gradientn(colors = brewer.pal(11, "Spectral"), limits = c(0, max(yearly_data$UniqueOwners, na.rm = TRUE)), name = "Total Dogs") +
    theme_minimal() +
    labs(title = paste("Heatmap of Unique Owners by Gender and Age Group in", year),
         x = "Owner's Gender",
         y = "Owner's Age Group",
         fill = "Number of Unique Owners") +
    theme(axis.text.y = element_text(angle = 45, hjust = 1))
  
  print(p)
}

# 10. Total Count of Dogs by District

10.1. General

We focus on examining the annual distribution of dog populations across various districts. To achieve this, we have implemented a specific R script that systematically processes and visualizes data for each year from our dataset.

# Adjusting DistrictSort to have levels from 1 to 12 as indicated
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))

# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)

# Iterate over each year and create a bar plot for districts 1 through 12, excluding NAs
for (year in unique_years) {
  yearly_data <- new_df %>%
    filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
    group_by(DistrictSort) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
    arrange(desc(TotalDogs))  # Arranging data by TotalDogs in descending order

  # Plot for the year with districts sorted by the number of dogs
  p <- ggplot(yearly_data, aes(x = reorder(DistrictSort, -TotalDogs), y = TotalDogs, fill = DistrictSort)) +
    geom_col() +  # Using geom_col for bar plots
    geom_hline(yintercept = c(100, 500, 1000, 1500), linetype = "dashed", color = "red") +  # Adding dashed lines at specified y-values
    scale_fill_viridis_d(name = "District") +
    scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, by = 500)) +  # Standardize y-axis up to 2000
    theme_minimal() +
    labs(title = paste("Total Count of Dogs by District in", year),
         x = "District",
         y = "Total Number of Dogs") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          legend.position = "none")  # Removing legend for clarity

  # Display the plot
  print(p)
}

10.2. By Dogs’ gender

To enhance our understanding of the distribution of dogs across different districts and to introduce a gender perspective into our analysis, we have modified our approach to include a breakdown by gender. This adjustment allows us to observe not only the geographical distribution but also gender dynamics within the dog population each year.

# Ensure DistrictSort is a factor with levels from 1 to 12
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))

# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)

# Iterate over each year and create a bar plot for districts 1 through 12, excluding NAs
for (year in unique_years) {
  yearly_data <- new_df %>%
    filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
    group_by(DistrictSort, DogSex) %>%  # Group by district and dog sex
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
    arrange(desc(TotalDogs))  # Arrange by TotalDogs in descending order for each group

  # Reorder DistrictSort based on TotalDogs for clearer visualization
  yearly_data$DistrictSort <- factor(yearly_data$DistrictSort, levels = unique(yearly_data$DistrictSort))

  # Plot for the year with districts 1 through 12, differentiating by gender
  p <- ggplot(yearly_data, aes(x = DistrictSort, y = TotalDogs, fill = DogSex)) +
    geom_col(position = position_dodge()) +  # Using geom_col with dodge to separate male and female bars
    geom_hline(yintercept = c(100, 500), linetype = "dashed", color = "red") +  # Adding dashed lines
    scale_fill_viridis_d(name = "Dog Gender") +
    scale_y_continuous(limits = c(0, 800), breaks = seq(0, 800, by = 100)) +
    theme_minimal() +
    labs(title = paste("Total Count of Dogs by District and Gender in", year),
         x = "District",
         y = "Total Number of Dogs") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          legend.position = "bottom")  # Adjust legend position for clarity
  
  print(p)  # Display the plot
}

10.3. By Unique Owners’ gender

# Ensure DistrictSort is a factor with levels from 1 to 12 in new_df
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))

# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)

# Iterate over each year and create a bar plot for districts 1 through 12, excluding NAs in DistrictSort
for (year in unique_years) {
  # Summarize total dogs by district
  yearly_total <- new_df %>%
    filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
    group_by(DistrictSort) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
    arrange(desc(TotalDogs))  # Arrange by TotalDogs in descending order for clarity in visualization
  
  # Summarize total dogs by district and gender
  yearly_gender <- new_df %>%
    filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
    group_by(DistrictSort, OwnerSex) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
    ungroup() %>%
    mutate(Position = as.numeric(DistrictSort) + ifelse(OwnerSex == "female", -0.2, 0.2))  # Adjust positions for clarity
  
  # Count unique owners by gender
  yearly_owner_count <- new_df %>%
    filter(KeyDateYear == year) %>%
    distinct(OwnerId, OwnerSex) %>%
    group_by(OwnerSex) %>%
    summarize(UniqueOwners = n(), .groups = 'drop')
  
  # Create the plot
  p <- ggplot() +
    geom_bar(data = yearly_total, aes(x = DistrictSort, y = TotalDogs, fill = DistrictSort), stat = "identity") +
    geom_point(data = yearly_gender, aes(x = Position, y = TotalDogs, color = OwnerSex), size = 3) +
    geom_text(data = yearly_owner_count, aes(x = Inf, y = Inf, label = paste(UniqueOwners, "unique owner(s)"), hjust = 1, vjust = 1), size = 3, color = "black") + # Add text for unique owner count
    geom_hline(yintercept = c(100, 500, 1000, 1500), linetype = "dashed", color = "red") +  # Adding dashed lines at specified y-values
    scale_fill_viridis_d() +
    scale_color_manual(values = c("female" = "pink", "male" = "blue")) +
    scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, by = 500)) +  # Standardize y-axis up to 2000
    theme_minimal() +
    labs(title = paste("Total Count of Dogs by District in", year, "— Female vs Male"),
         subtitle = "Bar: Total Count | Points: Count by Gender",
         x = "District",
         y = "Total Count of Dogs") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
  print(p)
}

10.4. By BreedType

In order to deepen our analysis of dog populations across different districts annually, our R script now incorporates an additional layer of granularity by assessing dog counts not only by district but also by breed type. This enhancement aims to provide a more detailed view of the diversity within the canine populations across the various districts each year.

# Adjusting DistrictSort to have levels from 1 to 12 as initially indicated
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))

# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)

# Iterate over each year and create a bar plot for districts 1 through 12
for (year in unique_years) {
  # Summarize total dogs by district
  yearly_total <- new_df %>%
    filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
    group_by(DistrictSort) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop')
  
  # Summarize total dogs by district and breed type, excluding "Unknown" breed type
  yearly_breed <- new_df %>%
    filter(KeyDateYear == year, !is.na(DistrictSort), BreedType != "Unknown") %>%
    group_by(DistrictSort, BreedType) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
    ungroup() %>%
    mutate(Position = as.numeric(DistrictSort))  # Position adjustment can be refined as needed
  
  # Create the plot
  p <- ggplot() +
    geom_bar(data = yearly_total, aes(x = DistrictSort, y = TotalDogs, fill = DistrictSort), stat = "identity") +
    geom_point(data = yearly_breed, aes(x = Position, y = TotalDogs, shape = BreedType), size = 3, position = position_jitterdodge(jitter.width = 0.2)) +
    geom_hline(yintercept = c(100, 500, 1000, 1500), linetype = "dashed", color = "red") +  # Adding dashed lines at specified y-values
    scale_fill_viridis_d(name = "District") +
    scale_color_viridis_d() +  # If BreedType needs color coding, this could be added/adjusted
    scale_shape_manual(values = seq(1, 20)) +  # Manually specifying shapes for breed type distinction
    scale_size(range = c(1, 6), name = "Total Dogs per Breed Type") +  # Adjusting point size to reflect total dogs per breed type
    scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, by = 500)) +  # Standardize y-axis up to 2000
    theme_minimal() +
    labs(title = paste("Total Count of Dogs by District in", year),
         subtitle = "Points indicate count by breed type",
         x = "District",
         y = "Total Count of Dogs") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
  print(p)
}

10.5. By BreedType and Dogs’gender

In our continuing efforts to provide a comprehensive analysis of the dog populations within various districts, our latest R script has been enhanced to include not only total counts by district but also a detailed breakdown by breed type and gender.

# Adjust DistrictSort to have levels from 1 to 12 as initially indicated in new_df
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))

# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)

for (year in unique_years) {
  # Calculate total number of dogs per district for each year
  yearly_total <- new_df %>%
    filter(KeyDateYear == year, !is.na(DistrictSort)) %>%
    group_by(DistrictSort) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop')
  
  # Calculate total number of dogs per district per breed type and gender
  yearly_breed_gender <- new_df %>%
    filter(KeyDateYear == year, !is.na(DistrictSort), BreedType != "Unknown") %>%
    group_by(DistrictSort, BreedType, DogSex) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
    ungroup() %>%
    mutate(Position = as.numeric(DistrictSort) + ifelse(DogSex == "female", -0.2, 0.2))  # Adjust position slightly for clarity
  
  # Plotting both total counts and counts by breed type and gender
  p <- ggplot() +
    geom_bar(data = yearly_total, aes(x = DistrictSort, y = TotalDogs, fill = DistrictSort), stat = "identity") +
    geom_point(data = yearly_breed_gender, aes(x = Position, y = TotalDogs, color = DogSex, shape = BreedType), size = 3, position = position_jitterdodge(jitter.width = 0.1)) +
    geom_hline(yintercept = c(100, 500, 1000, 1500), linetype = "dashed", color = "red") +
    scale_fill_viridis_d(name = "District") +
    scale_color_manual(values = c("female" = "pink", "male" = "blue")) +
    scale_shape_manual(values = seq(1, 20)) +
    scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, by = 500)) +
    theme_minimal() +
    labs(title = paste("Total Count of Dogs by District, Breed, and Gender in", year),
         subtitle = "Bar: Total Count | Points: Count by Breed and Gender",
         x = "District",
         y = "Total Count of Dogs") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

  print(p)
}
## Warning: `position_jitterdodge()` requires non-overlapping x intervals

## Warning: `position_jitterdodge()` requires non-overlapping x intervals

## Warning: `position_jitterdodge()` requires non-overlapping x intervals

## Warning: `position_jitterdodge()` requires non-overlapping x intervals

## Warning: `position_jitterdodge()` requires non-overlapping x intervals

## Warning: `position_jitterdodge()` requires non-overlapping x intervals

## Warning: `position_jitterdodge()` requires non-overlapping x intervals

## Warning: `position_jitterdodge()` requires non-overlapping x intervals

## Warning: `position_jitterdodge()` requires non-overlapping x intervals

11. Top Primary Breeds

##11.1. Including Unknown by year

# Data preprocessing with new transformations for new_df
new_df <- new_df %>%
  mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
         KeyDateYear = as.numeric(as.character(KeyDateYear)),
         DogAge = KeyDateYear - DogBirthYear)

# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)

# Function to assign colors to PrimaryBreed
assign_colors <- function(data) {
  n_breeds <- length(unique(data$PrimaryBreed))
  palette <- scales::hue_pal()(n_breeds)
  breed_color_map <- setNames(palette, unique(data$PrimaryBreed))
  return(breed_color_map)
}

# Iterate over each year and create a bar plot for primary breeds, excluding NAs
for (year in unique_years) {
  yearly_data <- new_df %>%
    filter(KeyDateYear == year, (DogAge >= 0 & DogAge <= 15) | is.na(DogAge)) %>%
    group_by(PrimaryBreed) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
    arrange(desc(TotalDogs)) %>%  # Arrange by TotalDogs in descending order
    slice_max(order_by = TotalDogs, n = 10)  # Select top 10 PrimaryBreed based on TotalDogs

  # Assign colors to PrimaryBreed
  breed_color_map <- assign_colors(yearly_data)
  
  # Plot for the year showing top 10 primary breeds
  p <- ggplot(yearly_data, aes(x = reorder(PrimaryBreed, TotalDogs), y = TotalDogs, fill = PrimaryBreed)) +
    geom_col() +  # Regular bar plot
    scale_fill_manual(values = breed_color_map) +  # Set manual colors for PrimaryBreed
    geom_hline(yintercept = c(250, 500), linetype = "dashed", color = "red") +  # Adding dashed lines at specified y-values
    scale_y_continuous(limits = c(0, max(750, max(yearly_data$TotalDogs))), breaks = seq(0, max(750, max(yearly_data$TotalDogs)), by = 250)) +  # Adjust y-axis based on max dog count
    theme_minimal() +
    labs(title = paste("Top 10 Primary Breeds by Total Count of Dogs in", year),
         x = "Primary Breed",
         y = "Total Number of Dogs") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

  print(p)  # Display the plot
}

##11.2. Excluding Unknown by year

# Data preprocessing with new transformations for new_df
new_df <- new_df %>%
  mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
         KeyDateYear = as.numeric(as.character(KeyDateYear)),
         DogAge = KeyDateYear - DogBirthYear)

# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)

# Function to assign colors to PrimaryBreed
assign_colors <- function(data) {
  n_breeds <- length(unique(data$PrimaryBreed))
  palette <- scales::hue_pal()(n_breeds)
  breed_color_map <- setNames(palette, unique(data$PrimaryBreed))
  return(breed_color_map)
}

# Iterate over each year and create a bar plot for primary breeds, excluding NAs and 'Unknown'
for (year in unique_years) {
  yearly_data <- new_df %>%
    filter(KeyDateYear == year, (DogAge >= 0 & DogAge <= 15) | is.na(DogAge), PrimaryBreed != "Unknown") %>%
    group_by(PrimaryBreed) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
    arrange(desc(TotalDogs)) %>%  # Arrange by TotalDogs in descending order
    slice_max(order_by = TotalDogs, n = 10)  # Select top 10 PrimaryBreeds based on TotalDogs

  # Assign colors to PrimaryBreed
  breed_color_map <- assign_colors(yearly_data)
  
  # Plot for the year showing top 10 primary breeds
  p <- ggplot(yearly_data, aes(x = reorder(PrimaryBreed, TotalDogs), y = TotalDogs, fill = PrimaryBreed)) +
    geom_col() +  # Regular bar plot
    scale_fill_manual(values = breed_color_map) +  # Set manual colors for PrimaryBreed
    geom_hline(yintercept = c(250, 500), linetype = "dashed", color = "red") +  # Adding dashed lines at specified y-values
    scale_y_continuous(limits = c(0, max(750, max(yearly_data$TotalDogs, na.rm = TRUE))), breaks = seq(0, max(750, max(yearly_data$TotalDogs, na.rm = TRUE)), by = 250)) +  # Adjust y-axis based on max dog count
    theme_minimal() +
    labs(title = paste("Top 10 Primary Breeds by Total Count of Dogs in", year),
         x = "Primary Breed",
         y = "Total Number of Dogs") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

  print(p)  # Display the plot
}

##11.3. Including Unknown by district

R script designed to analyze and visualize the distribution of the top five primary dog breeds in each district annually.

# Data preprocessing with new transformations for new_df
new_df <- new_df %>%
  mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
         KeyDateYear = as.numeric(as.character(KeyDateYear)),
         DogAge = KeyDateYear - DogBirthYear)

# Ensure 'DistrictSort' is set as a factor with levels from 1 to 12 for proper ordering
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))

# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)

# Function to assign colors to PrimaryBreed
assign_colors <- function(data) {
  n_breeds <- length(unique(data$PrimaryBreed))
  palette <- scales::hue_pal()(n_breeds)
  breed_color_map <- setNames(palette, unique(data$PrimaryBreed))
  return(breed_color_map)
}

# Iterate over each year and create a bar plot for districts 1 through 12, excluding NAs
for (year in unique_years) {
  yearly_data <- new_df %>%
    filter(KeyDateYear == year, !is.na(DistrictSort), (DogAge >= 0 & DogAge <= 15) | is.na(DogAge)) %>%
    group_by(DistrictSort, PrimaryBreed) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
    arrange(DistrictSort, desc(TotalDogs)) %>%  # Arrange by DistrictSort and TotalDogs in descending order
    group_by(DistrictSort) %>%
    top_n(6, wt = TotalDogs) %>%  # Select top 5 PrimaryBreed for each DistrictSort based on TotalDogs
    ungroup()
  
  # Assign colors to PrimaryBreed
  breed_color_map <- assign_colors(yearly_data)
  
  # Plot for the year with districts 1 through 12
  p <- ggplot(yearly_data, aes(x = DistrictSort, y = TotalDogs, fill = PrimaryBreed)) +
    geom_col(position = "stack") +  # Stacked bar plot
    scale_fill_manual(values = breed_color_map) +  # Set manual colors for PrimaryBreed
    geom_hline(yintercept = c(250, 500), linetype = "dashed", color = "red") +  # Adding dashed lines at specified y-values
    scale_y_continuous(limits = c(0, 750), breaks = seq(0, 750, by = 250)) +  # Standardize y-axis up to 2000
    theme_minimal() +
    labs(title = paste("Top 5 Primary Breeds by Total Count of Dogs in", year),
         x = "District",
         y = "Total Number of Dogs") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
  print(p)  # Display the plot
}

11.4. Excluding Unknown by district

Explicitly excluding any entries categorized under “Unknown.” This refined focus allows for a more precise and meaningful understanding of breed popularity and distribution, crucial for targeted animal welfare and urban planning strategies.

# Data preprocessing with new transformations for new_df
new_df <- new_df %>%
  mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
         KeyDateYear = as.numeric(as.character(KeyDateYear)),
         DogAge = KeyDateYear - DogBirthYear)

# Ensure 'DistrictSort' is set as a factor with levels from 1 to 12 for proper ordering
new_df$DistrictSort <- factor(new_df$DistrictSort, levels = as.character(1:12))

# Extract unique years for iteration
unique_years <- unique(new_df$KeyDateYear)

# Function to assign colors to PrimaryBreed
assign_colors <- function(data) {
  n_breeds <- length(unique(data$PrimaryBreed))
  palette <- scales::hue_pal()(n_breeds)
  breed_color_map <- setNames(palette, unique(data$PrimaryBreed))
  return(breed_color_map)
}

# Iterate over each year and create a bar plot for districts 1 through 12, excluding NAs and "Unknown" breeds
for (year in unique_years) {
  yearly_data <- new_df %>%
    filter(KeyDateYear == year, !is.na(DistrictSort), (DogAge >= 0 & DogAge <= 15) | is.na(DogAge), PrimaryBreed != "Unknown") %>%
    group_by(DistrictSort, PrimaryBreed) %>%
    summarize(TotalDogs = sum(NumberOfDogs), .groups = 'drop') %>%
    arrange(DistrictSort, desc(TotalDogs)) %>%  # Arrange by DistrictSort and TotalDogs in descending order
    group_by(DistrictSort) %>%
    top_n(6, wt = TotalDogs) %>%  # Select top 5 PrimaryBreed for each DistrictSort based on TotalDogs
    ungroup()

  # Assign colors to PrimaryBreed
  breed_color_map <- assign_colors(yearly_data)

  # Plot for the year with districts 1 through 12
  p <- ggplot(yearly_data, aes(x = DistrictSort, y = TotalDogs, fill = PrimaryBreed)) +
    geom_col(position = "stack") +  # Stacked bar plot
    scale_fill_manual(values = breed_color_map) +  # Set manual colors for PrimaryBreed
    geom_hline(yintercept = c(250, 500), linetype = "dashed", color = "red") +  # Adding dashed lines at specified y-values
    scale_y_continuous(limits = c(0, 750), breaks = seq(0, 750, by = 250)) +  # Standardize y-axis up to 750
    theme_minimal() +
    labs(title = paste("Top 6 Primary Breeds by Total Count of Dogs in", year),
         x = "District",
         y = "Total Number of Dogs") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

  print(p)  # Display the plot
}

12.Tracking Aging Dogs from 2015 to 2023

12.1. Including Unknown Data

To analyze the aging patterns of dogs from 2015 to 2023, we first preprocess the dataset to ensure the necessary variables are in the appropriate format. This includes converting DogAgeGroupCd to numeric and KeyDateYear to numeric. After this preprocessing step, we proceed with the following steps:

Identifying Dogs Present in 2015 and 2023: We filter the dataset to extract information about dogs present in 2015 and 2023, focusing on their OwnerId, PrimaryBreed, and SecondaryBreed.

Finding Dogs Present in Both Years: We find the intersection of dogs present in 2015 and 2023 to identify those that survived from 2015 to 2023.

Filtering Dataset for Surviving Dogs: Using the information obtained from the previous step, we filter the original dataset to retain records of dogs present in both 2015 and 2023.

Checking for Consistent Age Progression: We calculate the age difference for each dog between 2015 and 2023 and identify dogs with a consistent age progression of 8 years, assuming that DogAge reflects each dog’s age accurately.

RESULT: This data suggests that many dogs observed in both 2015 and 2023 have aged consistently by 8 years, indicating a typical aging pattern over the observed period.

# Convert to numeric and prepare new_df
new_df <- new_df %>%
  mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
         KeyDateYear = as.numeric(as.character(KeyDateYear)),
         DogAge = KeyDateYear - DogBirthYear)

# Step 1: Identify dogs present in 2015
dogs_in_2015 <- new_df %>%
  filter(KeyDateYear == 2015) %>%
  select(OwnerId, PrimaryBreed, SecondaryBreed)

# Step 2: Identify dogs present in 2023
dogs_in_2023 <- new_df %>%
  filter(KeyDateYear == 2023) %>%
  select(OwnerId, PrimaryBreed, SecondaryBreed)

# Step 3: Intersect the two groups to find dogs present in both years
surviving_dogs <- intersect(dogs_in_2015, dogs_in_2023)

# Filter the original dataset for these dogs and years
surviving_dogs_records <- new_df %>%
  semi_join(surviving_dogs, by = c("OwnerId", "PrimaryBreed", "SecondaryBreed")) %>%
  filter(KeyDateYear %in% c(2015, 2023))

# Check for consistent age progression for these dogs
surviving_dogs_age_check <- surviving_dogs_records %>%
  group_by(OwnerId, PrimaryBreed, SecondaryBreed) %>%
  summarise(AgeDifference = diff(sort(DogAge)), .groups = "drop")
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Dogs with consistent age progression
consistent_age_progression <- surviving_dogs_age_check %>%
  filter(AgeDifference == 8) 

print(consistent_age_progression)
## # A tibble: 1,670 × 4
##    OwnerId PrimaryBreed        SecondaryBreed AgeDifference
##    <chr>   <chr>               <chr>                  <dbl>
##  1 100002  Chihuahua           none                       8
##  2 100018  Zwergschnauzer      none                       8
##  3 100030  Chihuahua           none                       8
##  4 100034  Unknown             Unknown                    8
##  5 100052  Jack Russel Terrier none                       8
##  6 100060  Chihuahua           none                       8
##  7 100147  Unknown             Unknown                    8
##  8 100181  Malteser            none                       8
##  9 100181  Zwergpudel          none                       8
## 10 100184  Mops                none                       8
## # ℹ 1,660 more rows

12.2. Excluding Unknown Data

R code used to track the aging progression of dogs from 2015 to 2023 while excluding those with an ‘Unknown’ primary breed. This process involves identifying dogs present in both 2015 and 2023, filtering out those with an ‘Unknown’ primary breed, and then checking for consistent age progression among these dogs. The results are printed to examine dogs with an expected age difference of 8 years between 2015 and 2023.

RESULT: ID 100002 has a Chihuahua with no secondary breed, and the age of this dog has increased by 8 years from 2015 to 2023. Similarly, each subsequent row provides information about different dogs owned by different individuals, their breeds, and the corresponding age differences over the specified period.

# Convert to numeric and prepare new_df
new_df <- new_df %>%
  mutate(DogBirthYear = as.numeric(as.character(DogBirthYear)),
         KeyDateYear = as.numeric(as.character(KeyDateYear)),
         DogAge = KeyDateYear - DogBirthYear)

# Step 1: Identify dogs present in 2015, excluding 'Unknown' breeds
dogs_in_2015 <- new_df %>%
  filter(KeyDateYear == 2015, PrimaryBreed != "Unknown", SecondaryBreed != "Unknown") %>%
  select(OwnerId, PrimaryBreed, SecondaryBreed)

# Step 2: Identify dogs present in 2023, excluding 'Unknown' breeds
dogs_in_2023 <- new_df %>%
  filter(KeyDateYear == 2023, PrimaryBreed != "Unknown", SecondaryBreed != "Unknown") %>%
  select(OwnerId, PrimaryBreed, SecondaryBreed)

# Step 3: Intersect the two groups to find dogs present in both years
surviving_dogs <- intersect(dogs_in_2015, dogs_in_2023)

# Filter the original dataset for these dogs and years
surviving_dogs_records <- new_df %>%
  semi_join(surviving_dogs, by = c("OwnerId", "PrimaryBreed", "SecondaryBreed")) %>%
  filter(KeyDateYear %in% c(2015, 2023))

# Check for consistent age progression for these dogs
surviving_dogs_age_check <- surviving_dogs_records %>%
  group_by(OwnerId, PrimaryBreed, SecondaryBreed) %>%
  summarise(AgeDifference = diff(sort(DogAge)), .groups = "drop")
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Dogs with consistent age progression
consistent_age_progression_2 <- surviving_dogs_age_check %>%
  filter(AgeDifference == 8) 

print(consistent_age_progression_2)
## # A tibble: 1,340 × 4
##    OwnerId PrimaryBreed        SecondaryBreed AgeDifference
##    <chr>   <chr>               <chr>                  <dbl>
##  1 100002  Chihuahua           none                       8
##  2 100018  Zwergschnauzer      none                       8
##  3 100030  Chihuahua           none                       8
##  4 100052  Jack Russel Terrier none                       8
##  5 100060  Chihuahua           none                       8
##  6 100181  Malteser            none                       8
##  7 100181  Zwergpudel          none                       8
##  8 100184  Mops                none                       8
##  9 100192  Gordon Setter       none                       8
## 10 100220  Chihuahua           none                       8
## # ℹ 1,330 more rows

13. Analyzing the Top Dog Breeds from 2015 to 2023

These codes calculates and visualizes the count of surviving dogs from 2015 to 2023 for the top 10 breeds based on their occurrence across all years.

13.1. Including Unknown Data

By first determining the breeds with the highest occurrence across all years, it ensures a comprehensive selection. Then, filtering the dataset for dogs with consistent age progression, it narrows down the focus to these top breeds. The resulting bar plot vividly illustrates the distribution of surviving dogs among these breeds, providing valuable insights into their prevalence over the specified timeframe.

# Calculate the top 10 breeds based on their occurrence across all years in new_df
top_10_breeds <- new_df %>%
  count(PrimaryBreed, sort = TRUE) %>%
  slice_max(order_by = n, n = 10) %>%
  pull(PrimaryBreed)

# Assuming 'consistent_age_progression' is a subset of new_df reflecting dogs from 2015 to 2023
# that survived and are in the top 10 breeds
consistent_age_progression <- new_df %>%
  filter(PrimaryBreed %in% top_10_breeds, KeyDateYear %in% c(2015, 2023)) %>%
  # Add any additional filters here for survival or presence in both years if not already filtered
  group_by(OwnerId, PrimaryBreed) %>%
  filter(n() > 1) %>%
  ungroup()

# Pre-calculate the counts for plotting and ordering
top_breeds_counts <- consistent_age_progression %>%
  count(PrimaryBreed) %>%
  arrange(desc(n))

# Visualize the count of surviving dogs from 2015 to 2023 in the top 10 breeds
ggplot(top_breeds_counts, aes(x = reorder(PrimaryBreed, n), y = n, fill = PrimaryBreed)) +
  geom_col() +
  theme_minimal() +
  labs(title = "Surviving Dogs from 2015 to 2023 by Top 10 Breeds",
       x = "Breed", y = "Count of Surviving Dogs") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis_d() +
  theme(legend.position = "none")

13.2. Excluding Unknown Data

This code segment performs a refined analysis of aging dogs from 2015 to 2023, focusing specifically on the top 10 breeds while excluding any entries labeled as ‘Unknown’. By filtering out these unknown entries from the dataset before calculating the breed occurrences, it ensures a more accurate representation of the prevailing breeds.

# Calculate the top 10 breeds based on their occurrence across all years in new_df
top_10_breeds <- new_df %>%
  count(PrimaryBreed, sort = TRUE) %>%
  filter(PrimaryBreed != "Unknown") %>%
  slice_max(order_by = n, n = 10) %>%
  pull(PrimaryBreed)

# Assuming 'consistent_age_progression' is a subset of new_df reflecting dogs from 2015 to 2023
# that survived and are in the top 10 breeds
consistent_age_progression <- new_df %>%
  filter(PrimaryBreed %in% top_10_breeds, KeyDateYear %in% c(2015, 2023)) %>%
  # Add any additional filters here for survival or presence in both years if not already filtered
  group_by(OwnerId, PrimaryBreed) %>%
  filter(n() > 1) %>%
  ungroup()

# Pre-calculate the counts for plotting and ordering
top_breeds_counts <- consistent_age_progression %>%
  count(PrimaryBreed) %>%
  arrange(desc(n))

# Visualize the count of surviving dogs from 2015 to 2023 in the top 10 breeds
ggplot(top_breeds_counts, aes(x = reorder(PrimaryBreed, n), y = n, fill = PrimaryBreed)) +
  geom_col() +
  theme_minimal() +
  labs(title = "Surviving Dogs from 2015 to 2023 by Top 10 Breeds",
       x = "Breed", y = "Count of Surviving Dogs") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis_d() +
  theme(legend.position = "none")